Urban Heat Island Predictor
Spring 2025 Data Science Project
Team: Nithin Nambi, Aditya Koul, Arnav K, Archer Sariscak
Contributions
Nithin Nambi – I wrote the header, came up with the topic idea, wrote the introduction, completed checkpoint 1, wrote the building footprint analysis in data exploration and the graphs along with that.
Aditya Koul – I performed the exploratory data analysis and visualization of the UHI data. I also picked which of the three datasets to use for learning and what ML model to use. I created the model and wrote a few conclusions and results after completing training. Finally, I converted the notebook to HTML and uploaded it to my GitHub Pages site for hosting.
Arnav K – I performed the data curation of NY Mesonet weather data. I documented the data sources and transformation processes for all three datasets. For weather analysis, I compared temperature patterns between Bronx and Manhattan stations, creating statistical tests and visualizations.
Archer Sariscak – I created the visualization of the model accuracy. Additionally, I wrote the section explaining how someone else could use the model with example data. Finally, I wrote the conclusion and insights that we found through our model and analysis.
Introduction
Anyone who’s spent a summer night in Manhattan knows how the city can feel like a giant oven, thanks to the Urban Heat Island effect, urban spots end up much hotter than the countryside. If we don’t do something about UHIs, heat waves get nastier, our power grids take a large hit due to the extra load, and people with health issues face real danger.
For this project, we’re building a model to predict ground temperatures across NYC by combining four key data sources:
Ground readings from July 24, 2021
Sentinel-2 satellite imagery
Building‐footprint geometry
High-resolution weather observations
Our main questions are:
How close can we get to the actual ground temperatures using these inputs?
Which factors really drive the UHI effect at the neighborhood level?
These answers are very important because city planners, public‐health teams, and energy managers need accurate forecasts to zero in on cooling interventions, optimize building designs, and brace for extreme-heat events. Plus, our work plugs into an industry challenge on AI for urban heat mapping. So we’re both solving a real‐world problem and pushing the research on climate-resilient cities forward.
Data Curation/Preprocessing¶
Our project examines the Urban Heat Island (UHI) effect in New York City using three complementary datasets from the EY Open Science AI and Data Challenge (https://challenge.ey.com/challenges/the-2025-ey-open-science-ai-and-data-challenge-cooling-urban-heat-islands-external-participants/data-description).
The Building Footprint Data, provided by NYC's Office of Technology and Innovation, contains polygonal outlines of 9,436 buildings in our study area. This spatial data helps us understand how urban density impacts local temperatures. Structures absorb heat differently than vegetation, creating heat pockets that contribute to the UHI effect we're studying.
Our weather data comes from the New York State Mesonet network, with measurements from stations in both the Bronx and Manhattan. Collected at 5-minute intervals throughout July 24, 2021 (from 6:00 AM to 8:00 PM), it includes essential meteorological variables like air temperature, humidity, wind conditions, and solar radiation. Having data from two locations with different urban characteristics allows us to directly observe temperature differentials.
The UHI Index Training Data, developed by CAPA Strategies, provides our ground measurements of heat intensity. This dataset contains 11,229 georeferenced temperature readings collected between 3-4 PM (15:01-15:59) on July 24, 2021, covering upper Manhattan and parts of the Bronx. These measurements serve as our target variable for modeling neighborhood-level heat patterns.
In the code below, we transform these datasets into analysis-ready formats:
For building data, we convert coordinates to metric units (EPSG:3395) and calculate areas to quantify urban density.
The weather data undergoes timestamp standardization (removing EDT indicators and converting to datetime objects) and hourly aggregation to reveal temporal patterns in temperature differences between the Bronx and Manhattan.
For the UHI data, we convert string timestamps to proper datetime objects and analyze the distribution of temperature values across the city, with values ranging from 0.956 to 1.046 (mean: 1.000001).
These preprocessing steps allow us to investigate how urban structure and weather conditions contribute to temperature variations in different neighborhoods.
# Import statements
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# File paths
building_footprint_kml = '../training_data/Building_Footprint.kml'
NY_weather_xlsx = '../training_data/NY_Mesonet_Weather.xlsx'
UHI_data_csv = '../training_data/Training_data_uhi_index_2025-02-18.csv'
Data Exploration and Summary Statistics
import geopandas as gpd
gdf = gpd.read_file(building_footprint_kml, driver='KML')
df = pd.DataFrame(gdf)
print(df.head())
# In the first few rows, we can see that the Name and Description columns are all the same.
# The geometry column contains geometrical data representing the building footprints, in the form of MULTIPOLYGON.
Name Description geometry 0 MULTIPOLYGON (((-73.91903 40.8482, -73.91933 4... 1 MULTIPOLYGON (((-73.92195 40.84963, -73.92191 ... 2 MULTIPOLYGON (((-73.9205 40.85011, -73.92045 4... 3 MULTIPOLYGON (((-73.92056 40.8514, -73.92053 4... 4 MULTIPOLYGON (((-73.91234 40.85218, -73.91247 ...
print(df.dtypes)
# The Name and Description columns are objects (strings), while geometry is a geometry type,
# which is expected as this column is geospatial data (polygonal geometries of the buildings).
Name object Description object geometry geometry dtype: object
print(df.describe(include=[object]))
# With describe we are able to see that the Name and Description columns each have only one unique value.
# Since every entry is the same, we can ignore these columns for most types of analysis.
Name Description count 9436 9436 unique 1 1 top freq 9436 9436
print(df.info())
# The dataset contains 9436 entries and 3 columns. All columns are non-null,
# meaning there are no missing values in any of them. This is good because it ensures
# that our analysis will not be affected by missing data.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9436 entries, 0 to 9435 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 9436 non-null object 1 Description 9436 non-null object 2 geometry 9436 non-null geometry dtypes: geometry(1), object(2) memory usage: 221.3+ KB None
missing_data = df.isna().sum()
print("Missing data:", missing_data)
# There is no missing data in the dataset, as expected, since all columns show 0 missing values.
# This indicates that we have a clean dataset in terms of completeness.
Missing data: Name 0 Description 0 geometry 0 dtype: int64
gdf = gdf.to_crs(epsg=3395) #EPSG:3395 uses meters as units
gdf['area'] = gdf.geometry.area
print(gdf[['Name', 'area']].head())
# The areas of the buildings vary significantly, with values like 1080.60 m², 166.11 m², and so on.
# This suggests that the dataset includes buildings of different sizes.
Name area 0 1080.601783 1 166.114638 2 246.325998 3 138.914032 4 376.844794
gdf['area'].max()
# The building with the largest are is 124277 m^2
124277.6496267651
Conclusion: From the initial exploration, we found that the dataset contains only a small amount of variation in Name and Description. These columns don't contribute much to further analysis. We also confirmed that there is no missing data. Now, we should focus on exploring the geometrical data (building footprints), which likely holds the most relevant information for further analysis.
Hypothesis Testing
mean_area = gdf['area'].mean()
print(f"Mean area of the buildings: {mean_area} square meters")
# The mean area of the buildings is approximately 3479.11 m².
# This provides a reference value for our hypothesis testing.
Mean area of the buildings: 3479.1095318903926 square meters
from scipy import stats
t_area = 3500
t_stat, p_value = stats.ttest_1samp(gdf['area'], t_area)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
T-statistic: -0.36877837220982873, P-value: 0.7123012012019707
if p_value < 0.05: # Did it like this because I played around with the test area t_area.
print(f"We reject the null hypothesis: The average area is significantly different from {t_area} square meters.")
else:
print(f"We fail to reject the null hypothesis: The average area is not significantly different from {t_area} square meters.")
# The t-statistic is -0.37, and the p-value is 0.71.
# The t-statistic indicates that the sample mean is slightly less than the threshold.
# However, the p-value is much greater than 0.05, which means we fail to reject the null hypothesis.
We fail to reject the null hypothesis: The average area is not significantly different from 3500 square meters.
Plots
plt.figure(figsize=(10, 6))
gdf['area'].hist(bins=100, edgecolor='black')
plt.title('Distribution of Building Footprint Areas')
plt.xlabel('Area (meters squared)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
# The large peak on the left indicates that most buildings in NYC are very small.
# The long tail to the right shows that there are some buildings that are much larger, but there are relatively few.
# This suggests that the data has a few large outliers but the majority of buildings are relatively small in terms of area.
plt.figure(figsize=(8, 6))
gdf['area'].plot(kind='box', vert=False, color='lightcoral')
plt.title('Boxplot of Building Footprint Areas')
plt.xlabel('Area (meters squared)')
plt.show()
# The boxplot provides a visualization of the spread and outliers in the building area data.
# The central 50% of the building areas are in between the lower and upper quartiles,
# with a lots of outliers visible on the upper end of the plot.
# These outliers represent buildings with areas that are larger than the median,
# and could represent very large buildings or errors in the data.
# The boxplot highlights the presence of extreme values, which could influence the overall statistics.
NY Weather Data Analysis¶
# First, we will load in the NY Mesonet Weather data from the Excel file
# and do some basic data exploration to understand what we're working with
import seaborn as sns
from scipy import stats
import matplotlib.dates as mdates
from datetime import datetime
# Set the style for plots - using seaborn for nicer visualizations
plt.style.use('seaborn-v0_8')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 7)
# Read the Excel file containing data for both Bronx and Manhattan
NY_weather_xlsx = '../training_data/NY_Mesonet_Weather.xlsx'
bronx_df = pd.read_excel(NY_weather_xlsx, sheet_name="Bronx")
manhattan_df = pd.read_excel(NY_weather_xlsx, sheet_name="Manhattan")
# Let's look at the basic info to understand what we're working with
print("Dataset Information:")
print(f"Bronx data: {bronx_df.shape[0]} rows, {bronx_df.shape[1]} columns")
print(f"Manhattan data: {manhattan_df.shape[0]} rows, {manhattan_df.shape[1]} columns")
print("\nBronx data - first 5 rows:")
print(bronx_df.head())
Dataset Information:
Bronx data: 169 rows, 6 columns
Manhattan data: 169 rows, 6 columns
Bronx data - first 5 rows:
Date / Time Air Temp at Surface [degC] \
0 2021-07-24 06:00:00 EDT 19.3
1 2021-07-24 06:05:00 EDT 19.4
2 2021-07-24 06:10:00 EDT 19.3
3 2021-07-24 06:15:00 EDT 19.4
4 2021-07-24 06:20:00 EDT 19.4
Relative Humidity [percent] Avg Wind Speed [m/s] \
0 88.2 0.8
1 87.9 0.8
2 87.6 0.7
3 87.4 0.5
4 87.0 0.2
Wind Direction [degrees] Solar Flux [W/m^2]
0 335 12
1 329 18
2 321 25
3 307 33
4 301 42
# We rename the columns to make them easier to work
# with throughout the analysis
column_mapping = {
"Date / Time": "datetime",
"Air Temp at Surface [degC]": "temperature",
"Relative Humidity [percent]": "humidity",
"Avg Wind Speed [m/s]": "wind_speed",
"Wind Direction [degrees]": "wind_direction",
"Solar Flux [W/m^2]": "solar_flux"
}
bronx_df = bronx_df.rename(columns=column_mapping)
manhattan_df = manhattan_df.rename(columns=column_mapping)
# Looking at the datetime column, we need to convert it to actual datetime objects
# This will make time-based analysis much easier
bronx_df['datetime'] = pd.to_datetime(bronx_df['datetime'].str.replace(' EDT', ''))
manhattan_df['datetime'] = pd.to_datetime(manhattan_df['datetime'].str.replace(' EDT', ''))
# We extract just the hour information for hourly analysis
bronx_df['hour'] = bronx_df['datetime'].dt.hour
manhattan_df['hour'] = manhattan_df['datetime'].dt.hour
# Add location identifier to each dataframe so we know which is which
bronx_df['location'] = 'Bronx'
manhattan_df['location'] = 'Manhattan'
# Now we check for any missing values or other data quality issues,
# and look at the timespan of our data
# Create a combined dataset for easier comparison later
combined_df = pd.concat([bronx_df, manhattan_df], ignore_index=True)
# Check for missing values
print("\nMissing Values Check:")
print(f"Bronx: {bronx_df.isnull().sum().sum()} missing values")
print(f"Manhattan: {manhattan_df.isnull().sum().sum()} missing values")
# Get time range of the data
earliest = bronx_df['datetime'].min()
latest = bronx_df['datetime'].max()
print("\nTime Range:")
print(f"Earliest datetime: {earliest}")
print(f"Latest datetime: {latest}")
# We have no missing values, and the data covers about a 14-hour period
# on July 24, 2021, from around 6 AM to 8 PM.
Missing Values Check: Bronx: 0 missing values Manhattan: 0 missing values Time Range: Earliest datetime: 2021-07-24 06:00:00 Latest datetime: 2021-07-24 20:00:00
# Now let's analyze the Urban Heat Island effect by comparing temperatures
# between Manhattan and the Bronx throughout the day
# First, we group by hour and calculate the average temperature for each location
bronx_hourly = bronx_df.groupby('hour')['temperature'].mean()
manhattan_hourly = manhattan_df.groupby('hour')['temperature'].mean()
# Let's look at the hourly temperature values to understand the differences
print("\nHourly temperature values:")
for hour in range(6, 21):
if hour in bronx_hourly.index and hour in manhattan_hourly.index:
diff = manhattan_hourly[hour] - bronx_hourly[hour]
print(f"Hour {hour}: Bronx {bronx_hourly[hour]:.2f}°C, Manhattan {manhattan_hourly[hour]:.2f}°C, " +
f"Diff: {diff:.2f}°C")
# We can clearly see that Manhattan is warmer in the morning hours, and then
# the Bronx becomes warmer in the afternoon.
Hourly temperature values: Hour 6: Bronx 19.39°C, Manhattan 21.69°C, Diff: 2.30°C Hour 7: Bronx 20.09°C, Manhattan 22.50°C, Diff: 2.41°C Hour 8: Bronx 21.67°C, Manhattan 23.44°C, Diff: 1.77°C Hour 9: Bronx 23.61°C, Manhattan 24.36°C, Diff: 0.75°C Hour 10: Bronx 24.85°C, Manhattan 24.88°C, Diff: 0.03°C Hour 11: Bronx 25.86°C, Manhattan 25.48°C, Diff: -0.38°C Hour 12: Bronx 26.46°C, Manhattan 26.34°C, Diff: -0.12°C Hour 13: Bronx 26.94°C, Manhattan 27.23°C, Diff: 0.29°C Hour 14: Bronx 27.48°C, Manhattan 27.31°C, Diff: -0.18°C Hour 15: Bronx 27.52°C, Manhattan 26.73°C, Diff: -0.78°C Hour 16: Bronx 26.86°C, Manhattan 26.84°C, Diff: -0.02°C Hour 17: Bronx 26.09°C, Manhattan 25.72°C, Diff: -0.37°C Hour 18: Bronx 25.22°C, Manhattan 25.18°C, Diff: -0.04°C Hour 19: Bronx 25.01°C, Manhattan 25.10°C, Diff: 0.09°C Hour 20: Bronx 24.90°C, Manhattan 24.60°C, Diff: -0.30°C
# Let's calculate some key metrics to quantify the heat island effect
# Diurnal temperature range (DTR) is the difference between daily max and min temps
bronx_dtr = bronx_hourly.max() - bronx_hourly.min()
manhattan_dtr = manhattan_hourly.max() - manhattan_hourly.min()
# For the analysis, we want to look at different periods throughout the day
# Morning (6-9 AM), Midday (10 AM - 1 PM), Afternoon (2-5 PM), Evening (6-8 PM)
# Define hour ranges for each period
morning_hours = range(6, 10)
midday_hours = range(10, 14)
afternoon_hours = range(14, 18)
evening_hours = range(18, 21)
# Filter data for each period
bronx_morning = bronx_df[bronx_df['hour'].isin(morning_hours)]
manhattan_morning = manhattan_df[manhattan_df['hour'].isin(morning_hours)]
bronx_midday = bronx_df[bronx_df['hour'].isin(midday_hours)]
manhattan_midday = manhattan_df[manhattan_df['hour'].isin(midday_hours)]
bronx_afternoon = bronx_df[bronx_df['hour'].isin(afternoon_hours)]
manhattan_afternoon = manhattan_df[manhattan_df['hour'].isin(afternoon_hours)]
bronx_evening = bronx_df[bronx_df['hour'].isin(evening_hours)]
manhattan_evening = manhattan_df[manhattan_df['hour'].isin(evening_hours)]
# Calculate mean temperatures for each period
morning_diff = manhattan_morning['temperature'].mean() - bronx_morning['temperature'].mean()
midday_diff = manhattan_midday['temperature'].mean() - bronx_midday['temperature'].mean()
afternoon_diff = manhattan_afternoon['temperature'].mean() - bronx_afternoon['temperature'].mean()
evening_diff = manhattan_evening['temperature'].mean() - bronx_evening['temperature'].mean()
# Now we verify our calculations to make sure they're correct
print("\nTemperature differences by time period (Manhattan - Bronx):")
print(f"Morning (6-9 AM): {morning_diff:.2f}°C")
print(f"Midday (10 AM - 1 PM): {midday_diff:.2f}°C")
print(f"Afternoon (2-5 PM): {afternoon_diff:.2f}°C")
print(f"Evening (6-8 PM): {evening_diff:.2f}°C")
Temperature differences by time period (Manhattan - Bronx): Morning (6-9 AM): 1.81°C Midday (10 AM - 1 PM): -0.04°C Afternoon (2-5 PM): -0.34°C Evening (6-8 PM): 0.01°C
# We run a statistical test to see if the morning temperature difference
# is statistically significant. We'll use a two-sample t-test.
t_stat_morning, p_value_morning = stats.ttest_ind(
bronx_morning['temperature'],
manhattan_morning['temperature'],
equal_var=False # Using Welch's t-test since variances might differ
)
print("\n=== Urban Heat Island Effect Analysis ===")
print(f"Bronx diurnal temperature range: {bronx_dtr:.2f}°C")
print(f"Manhattan diurnal temperature range: {manhattan_dtr:.2f}°C")
print("\nStatistical test for morning temperature difference:")
print(f"t-statistic: {t_stat_morning:.4f}")
print(f"p-value: {p_value_morning:.10f}")
if p_value_morning < 0.05:
print("Result: Statistically significant difference in morning temperatures (p < 0.05)")
print(f"The morning temperature difference of +{morning_diff:.2f}°C is statistically significant!")
else:
print("Result: No statistically significant difference in morning temperatures")
# The p-value is extremely small (much less than 0.05), which means
# the temperature difference we observed is very unlikely to be due to chance.
# This is strong evidence of the urban heat island effect.
=== Urban Heat Island Effect Analysis === Bronx diurnal temperature range: 8.12°C Manhattan diurnal temperature range: 5.62°C Statistical test for morning temperature difference: t-statistic: -6.2925 p-value: 0.0000000163 Result: Statistically significant difference in morning temperatures (p < 0.05) The morning temperature difference of +1.81°C is statistically significant!
# Now we'll create a visualization to show the urban heat island effect
# We want to make a figure that shows both the hourly temperatures and
# the differences between locations
plt.figure(figsize=(14, 9))
# Main plot: Hourly temperatures
ax1 = plt.subplot2grid((3, 3), (0, 0), colspan=3, rowspan=2)
ax1.plot(bronx_hourly.index, bronx_hourly.values, 'o-', color='#1f77b4', linewidth=3, label='Bronx', markersize=8)
ax1.plot(manhattan_hourly.index, manhattan_hourly.values, 'o-', color='#ff7f0e', linewidth=3, label='Manhattan', markersize=8)
# Shade the morning period to highlight UHI effect
ax1.axvspan(6, 9, alpha=0.15, color='green', label='Morning Hours')
# Add annotations for key times
bronx_max_hour = bronx_hourly.idxmax()
manhattan_max_hour = manhattan_hourly.idxmax()
ax1.annotate(f'Peak: {bronx_hourly.max():.1f}°C',
xy=(bronx_max_hour, bronx_hourly.max()),
xytext=(bronx_max_hour+0.5, bronx_hourly.max()+0.5),
arrowprops=dict(arrowstyle='->', color='#1f77b4'),
color='#1f77b4')
ax1.annotate(f'Peak: {manhattan_hourly.max():.1f}°C',
xy=(manhattan_max_hour, manhattan_hourly.max()),
xytext=(manhattan_max_hour+0.5, manhattan_hourly.max()+0.5),
arrowprops=dict(arrowstyle='->', color='#ff7f0e'),
color='#ff7f0e')
# Annotate morning difference
ax1.annotate(f'Morning Difference: +{morning_diff:.2f}°C\n(p < 0.0001)',
xy=(7.5, (bronx_hourly[7] + manhattan_hourly[7])/2),
xytext=(7.5, (bronx_hourly[7] + manhattan_hourly[7])/2 - 1.5),
ha='center',
bbox=dict(boxstyle="round,pad=0.5", fc="white", ec="gray", alpha=0.8),
arrowprops=dict(arrowstyle='->', color='black'))
# Add labels to main graph
ax1.set_title('Urban Heat Island Effect: Hourly Temperature Comparison', fontsize=18, fontweight='bold', pad=20)
ax1.set_xlabel('Hour of Day', fontsize=14)
ax1.set_ylabel('Average Temperature (°C)', fontsize=14)
ax1.set_xticks(range(6, 21))
ax1.grid(True, linestyle='--', alpha=0.7)
ax1.legend(fontsize=12)
# Subplot: Temperature difference
ax2 = plt.subplot2grid((3, 3), (2, 0), colspan=2)
temp_diff = manhattan_hourly - bronx_hourly
bars = ax2.bar(temp_diff.index, temp_diff.values, color=['green' if val > 0 else 'blue' for val in temp_diff])
# Color the bars by time period
for i, bar in enumerate(bars):
hour = temp_diff.index[i]
if 6 <= hour <= 9: # Morning
bar.set_color('#8cc751') # Light green
elif 10 <= hour <= 13: # Midday
bar.set_color('#f5b642') # Light orange
elif 14 <= hour <= 17: # Afternoon
bar.set_color('#f58d42') # Darker orange
else: # Evening
bar.set_color('#4286f5') # Blue
ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax2.set_xlabel('Hour of Day', fontsize=12)
ax2.set_ylabel('Temperature Difference\n(Manhattan - Bronx, °C)', fontsize=12)
ax2.set_xticks(range(6, 21))
ax2.grid(True, axis='y', linestyle='--', alpha=0.7)
# Subplot: Key statistics and findings
ax3 = plt.subplot2grid((3, 3), (2, 2))
ax3.axis('off') # Turn off axis
# Add text box with key statistics
stats_text = (
"Urban Heat Island Effect:\n"
"-------------------------\n"
f"Morning diff: +{morning_diff:.2f}°C\n"
f"Midday diff: {midday_diff:.2f}°C\n"
f"Afternoon diff: {afternoon_diff:.2f}°C\n"
f"Evening diff: {evening_diff:.2f}°C\n\n"
f"DTR Bronx: {bronx_dtr:.2f}°C\n"
f"DTR Manhattan: {manhattan_dtr:.2f}°C\n\n"
f"t-test: {t_stat_morning:.2f}\n"
f"p-value: <0.0001"
)
ax3.text(0, 1, stats_text, fontsize=10, va='top',
bbox=dict(boxstyle="round,pad=0.5", fc="#f0f0f0", ec="gray"))
plt.tight_layout()
plt.show()
Conclusion¶
Our testing indicates significant microclimatic differences between the Bronx and Manhattan. Manhattan exhibits classic urban heat island characteristics, with notably warmer morning temperatures (p < 0.0001) and reduced daily temperature fluctuations compared to the Bronx. This temperature difference is most pronounced during morning hours but diminishes throughout the day, even slightly reversing by afternoon. The data suggests that Manhattan's dense urban landscape retains more overnight heat, while the Bronx may experience greater daytime warming due to less building shade. These patterns align with established urban climatology research on how building density, construction materials, and vegetation coverage influence local temperature variations.
UHI Data Analysis¶
UHI_df = pd.read_csv(UHI_data_csv)
print(UHI_df.head())
Longitude Latitude datetime UHI Index 0 -73.909167 40.813107 24-07-2021 15:53 1.030289 1 -73.909187 40.813045 24-07-2021 15:53 1.030289 2 -73.909215 40.812978 24-07-2021 15:53 1.023798 3 -73.909242 40.812908 24-07-2021 15:53 1.023798 4 -73.909257 40.812845 24-07-2021 15:53 1.021634
UHI_df.dtypes
Longitude float64 Latitude float64 datetime object UHI Index float64 dtype: object
# We see that this CSV file has four features: Longitude, Latitude, datetime, and
# UHI Index. However, the datetime column does not contain datetime objects, so
# we must first convert the column to hold datetime objects.
UHI_df['datetime'] = pd.to_datetime(UHI_df['datetime'], format="%d-%m-%Y %H:%M")
UHI_df.dtypes
print(UHI_df.head())
Longitude Latitude datetime UHI Index 0 -73.909167 40.813107 2021-07-24 15:53:00 1.030289 1 -73.909187 40.813045 2021-07-24 15:53:00 1.030289 2 -73.909215 40.812978 2021-07-24 15:53:00 1.023798 3 -73.909242 40.812908 2021-07-24 15:53:00 1.023798 4 -73.909257 40.812845 2021-07-24 15:53:00 1.021634
# Looking through the data, I see that the dates from all of the data are similar.
# To make sure, I print the earliest and latest dates for the data below
earliest = UHI_df['datetime'].min()
latest = UHI_df['datetime'].max()
print("Earliest datetime:", earliest)
print("Latest datetime:", latest)
Earliest datetime: 2021-07-24 15:01:00 Latest datetime: 2021-07-24 15:59:00
# We see that all fo the data was gathered within an hour of each other, so this
# column provides no meaningful information. Therefore I will delete it, but we
# will also have to keep in mind that the UHI values for these latitudes and
# longitudes were recorded on July 24th, a very warm time of the year in New York
UHI_df = UHI_df.drop(columns=['datetime'])
print(UHI_df.head())
Longitude Latitude UHI Index 0 -73.909167 40.813107 1.030289 1 -73.909187 40.813045 1.030289 2 -73.909215 40.812978 1.023798 3 -73.909242 40.812908 1.023798 4 -73.909257 40.812845 1.021634
print(UHI_df.count())
Longitude 11229 Latitude 11229 UHI Index 11229 dtype: int64
# With the code above, we see that the df has 11229 entries with each column also
# having 11229 entries. This is good because this means there are no missing values
# in the data. We will use the describe function to get a better understanding
# of the data below
summary = UHI_df.describe()
print(summary)
Longitude Latitude UHI Index count 11229.000000 11229.000000 11229.000000 mean -73.933927 40.808800 1.000001 std 0.028253 0.023171 0.016238 min -73.994457 40.758792 0.956122 25% -73.955703 40.790905 0.988577 50% -73.932968 40.810688 1.000237 75% -73.909647 40.824515 1.011176 max -73.879458 40.859497 1.046036
The descriptive statistics shows some interesting trends. The Latitude and Longitude values have a very narrow range which makes sense as the data is only focused on Manhattan. What we found surprising is that the UHI Index also had a relatively small range. It was centered at a UHI Index of almost exactly 1 (1.000001), and since the median is close to the mean, we can say that the UHI data is fairly symmetrically distributed.
# We can getter a better visual understanding of the data by plotting it. Below,
# we plot the Latitude and Longitude values to see where the data was taken from.
# To get a better understanding of the data in context, we plot it on top of a map
# of Manhattan
import folium
from IPython.display import HTML
# Create a map centered around Manhattan
manhattan_map = folium.Map(location=[40.81, -73.9402778], zoom_start=12) # Center found with trial and error
# Add points to the map
for _, row in UHI_df.iterrows():
folium.CircleMarker(
location=[row['Latitude'], row['Longitude']],
radius=5,
color='blue',
fill=True,
fill_color='blue',
fill_opacity=0.5
).add_to(manhattan_map)
iframe = manhattan_map._repr_html_()
display(HTML('<div style="width:800px; height:500px">{}</div>'.format(iframe)))
manhattan_map.save('maps/uhi_locations.html')
View Manhattan UHI Locations Map if Above is not Loading
The map shows that the locations that the data was taken was only in upper Manhattan. It started from West 57th Street (around the bottom of Central Park) and went up to into the Bronx. This is interesting, as none of Midtown or lower Manhattan was captured.
# We only plotted the Latitude and Longitude values above. There is no UHI metric
# shown, so the heatmap below shows the UHI indices for each Latitude and Longitude
from folium.plugins import HeatMap
from branca.colormap import LinearColormap
manhattan_map = folium.Map(location=[40.81, -73.9402778], zoom_start=12)
# Create a custom color map for the small range of UHI values
min_uhi = 0.956122
max_uhi = 1.046036
colors = ['blue', 'lightblue', 'yellow', 'orange', 'red']
colormap = LinearColormap(colors, vmin=min_uhi, vmax=max_uhi)
colormap.caption = 'UHI Index'
# Add the color map to the map
colormap.add_to(manhattan_map)
# Add points to the map with colors based on UHI Index
for _, row in UHI_df.iterrows():
folium.CircleMarker(
location=[row['Latitude'], row['Longitude']],
radius=5,
color=colormap(row['UHI Index']), # Use the same color for border as fill
weight=0, # Set weight to 0 to remove the border
fill=True,
fill_color=colormap(row['UHI Index']),
fill_opacity=0.7,
popup=f"UHI Index: {row['UHI Index']:.6f}"
).add_to(manhattan_map)
iframe = manhattan_map._repr_html_()
display(HTML('<div style="width:800px; height:500px">{}</div>'.format(iframe)))
manhattan_map.save('maps/uhi_heatmap.html')
# Note: There is text below this section but sometimes when viewing in GitHub,
# it does not show for some reason. To see the next section, download this notebook
# and view locally.